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1. Introduction 

The analysis of algorithms which allow to learn a rule from random examples is an 
active and fascinating topic in the area of statistical mechanics. For an overview see e.g. 
IJ, 0, 0|- Many models, where examples are correctly classified by ideal experts (often 
called teachers) seem to be well understood. Now, there is a great deal of interest in 
nonideal, but more realistic models, which incorporate the influence of different types 
of noise in learning. 

In this paper, we study a model where not all examples carry information about the 
unknown rule, but where a nonzero fraction of them are just outliers. Naively learning 
all examples may considerably deteriorate the ability to infer the rule in such a case. 
Similar to learning with noisy data, some knowledge about the stochastic data generating 
mechanism can be helpful. Based on such a stochastic model, a good algorithm could try 
to select the informative examples and discard the remaining ones. Since however only 
partial information is available, such a selection can only be performed approximately 
and it is natural to try a soft, probabilistic selection. 

Our model leads naturally to such a selection method. It consists of a classification 
problem, where data which come from two distributions (classes) centered at different 
points are mixed at random with outliers. A Bayesian approach, which aims at 
calculating the most probable values for the class centers by minimizing a specific 
training energy is combined with the so-called EM algorithm of Dempster et al 
which nicely deals with the problem of hidden parameters (the knowledge which of the 
data are informative) in data mixtures. This procedure leads to an algorithm which 
iteratively computes the probability that an example is informative and weights each 
example in predicting the unknown class centers of the data generating distributions. 
Our model may also be considered as a simple version of the mixtures of experts models 
|| which are frequently studied in the neural network literature. In these models, a 
complicated task is learnt by a division of labor among several simple learning machines 
(experts), where each expert learns from different subsets of examples. Our model 
would correspond to two experts where only one is able to extract information from the 
examples. 

The paper is organized as follows: After an introduction of the learning problem, 
two learning strategies are defined in section two. Section three gives the statistical 
mechanics formulation of the problem, which, based on a replica calculation, leads to 
a computation of the learning performance in the thermodynamic limit. In section 
four the algorithmic implementation of the learning methods using the EM algorithm 
is explained. Section five presents the results of the statistical mechanics calculations 
and of numerical simulations and concludes with a discussion. Details of the replica 
calculations are given in the appendices. 
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2. The Learning Problem 



We assume that the examples S»} E WL N ,S^ E {±1}), fi = l,...,aN, are 
generated alternatively by two different processes. For the first process, the input £ M is 
selected at random from one of two gaussian clusters (labelled by the outputs = ±1) 
which are chosen with equal probability. The clusters are centered at ±B and have equal 
variance 1/7. B is an N dimensional vector with B 2 /N = 1. The joint probability for 
inputs and outputs corresponding to this process can be written as 

~, ^ / , 1 



V(^,S»\B) ocexp 




The data from this process represent classified examples in a noisy (because the Gaussian 
clusters overlap) two-class problem. 

In the second process, the inputs come from a single gaussian centered at zero with 
the same variance and the output (chosen ±1 with equal probability) is completely 
independent from the input. For this case, we make the ansatz 



V(^,S"\B) oc exp 



The data from the second process may be understood as representing outliers which do 
not contain any information about the two spatially structured classes of inputs and 
come from a "garbage" class and are classified purely by random guessing. In order to 
distinguish the two processes, we introduce decision variables E {0, 1}, where = 1 
stands for the first process and = for the outliers. The joint set of decision variables 
is denoted by {V^}^. Conditioning on these variables, we can write the probability 
distribution for the the joint set of aN data ID := S^}^, /i — 1, . . . ,p — aN within 
the single equation 
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In order to model the fact that outliers occur at random with a fixed rate, we will 
assume that both processes (structure, outliers) are chosen independently at random. 
The probability for having the value is written as 

exp[— r]V^] 



(2) 



1 + exp 

Using the "chemical potential" rj, we can adjust the average fraction of structured data 

1 



exp [77] + r 
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For r) = — oo all examples have V M = 1, but with increasing 77, less examples carry 
information. For 77 = 0, only half of the examples come from the structure and for 
77 = 00 all examples are outliers. 

A learner tries to infer the vector B from the aN examples and makes an estimate 
J for B. We will assume that the fraction of outliers is known to the learner. Although 
in our final results we will mostly deal with the case that also the parameter 7 is known 
precisely, we will be more general in the basic definitions and assume that the learner 
uses 7 instead, with 7 7^ 7. Hence, if the {V^} M were known, the likelihood of the data 
based on the estimate J would be given by 
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In general, however, the learner does not know which of the examples contain 
information and which are outliers. Hence, to the learner the { V^} M are hidden variables 
which are not observed but need to be averaged over. Hence, the actual ansatz for the 
distribution of data will be given by the mixture distribution 



where 
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and where we have defined 



[l + exp[-rj\) 
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One possible way of getting an estimate for the unknown vector B, would be the 
maximum likelihood method, i.e., one would use the vector J which maximizes the 
likelihood ([3]). A second possibility is given by a Bayesian approach, where the learner 
supplies some prior knowledge about reasonable estimates J within a prior distribution. 
We will use a distribution which on average gives the correct length of the unknown 
vector but does not favour any spatial direction 



V(J)= — 
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Based on the prior and the likelihood of the data, the learner can construct the posterior 
distribution, using Bayes rule 



V(J\JD) 



V(E> I J)V(J) 
7? (ID) ' 



(6) 
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There are several ways of using the information contained in the posterior (^). E.g., 
simply taking the posterior mean as the estimate for B will minimize the expected 
average (with respect to the posterior) squared error. Unfortunately, for a high 
dimensional space, such expectations will not be easy to calculate exactly, and one 
has to resort to Monte Carlo sampling. A simpler estimate, which should not perform 
too poorly, is given by the vector J, which has maximal aposteriori probability (MAP), 
i.e., the one which maximizes (0). Actually, if there are enough data available, one can 
expect that the posterior will be close to a gaussian, and both estimates will come close. 

In order to maximize the posterior V {J | ID) with respect to J, we can equivalently 
minimize the "training" -energy function 

H(J) = - InV (ID, J) = - In ]T V (ID, | J) V (J) . (7) 

As we will see in section four, there is a simple algorithm to calculate the MAP. As we 
will see, this algorithm is based on a recursive estimation of the (posterior) expected 
decision variables {V 11 }^. Since examples will be weighted by their probability of being 
informative rather than being kept or discarded from the training set, we call this 
method a soft selection of examples. 

As an alternative to the MAP approach for J, we will discuss also an algorithm 
which calculates the MAP for the hidden variables {V^}^. Since these variables take 
the values and 1 only, the result will be a hard selection of informative examples, 
rather than a soft weighting. We look for the values of {V 7 ^ which maximize 

r({v^\JD) = v ^^}») (8) 

Equivalently, we can maximize the numerator of this expression, which can be written 
as a mixture probability 

V (ID, = J dJ V (ID, {V% t J) (9) 

resulting in a training energy 

= ~lnJdJV (ID, J, . (10) 

Finally, after minimization, we can use the expectations 

JdJJ 3 V(1D,J,{V»},) 
{Jj)j fdJV(1D,J,{V»},) [ ' 

as an estimate for the unknown B*. 



3. Analysis by Statistical Mechanics 



In this section, we study the performance of both MAP estimates analytically in the 
thermodynamic limit N — > oo using a statistical mechanics framework. We begin first 
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with the soft selection. There are different ways of measuring, how good the learner, 
equipped with the MAP estimate, has learnt the structured distribution. An obvious 
idea is to measure the quadratic deviation between the true vector B and the MAP: 

1 _ N 2 



A = -((J-B) Z ) = Q-2R+1 (12) 
where we have defined the order parameters 

Q = ^(J) 2 - (13) 

It is also useful to calculate the angle $ = Z( J, B) between estimate and B. This angle 
<3>, normalized by f /ix is given in terms of the order parameters by 

Pii (14) 

(15) 

The order parameters for the soft selection MAP algorithm can be derived from a 
partition function Z where the corresponding hamiltonian is given by 7~t(J) from (7). 
Assuming that the inverse temperature f3 is an integer, we define 

J dJ exp [—(3H{J)\ 

dJ exp [(3\nV (ID, J)] 

dJ (V (ID, J)) 13 

= j dJ \ E VQD,{V»}„,J)\ (16) 
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The MAP, which is the minimum of the energy Ti(J), is derived from the limit (3 — > oo. 
The case (5 — 1 would correspond to Gibbs learning, where a vector J is drawn at 
random from the posterior. As usual, order parameters are found from an average of 
the free energy / = — j^lnZ over the distribution of the examples. To perform the 
average, we utilize the replica trick 

\J/ p N \ l 

' I^I:M^> (17) 
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where (. . .) denotes the average over the distribution (see (1) and (2)) 

1 




The replicated partition function is now written as 

Z n = E [UdJ*l[VQD,{V&UJ a ). (18) 
{v*%}/ a ^ h 

where the decision variables contain two replica indices. Here, the index a runs from 
1 to n, whereas b runs from 1 to (3. For the subsequent calculations we have assumed 
the correct parameters 7 = 7 and have made a replica symmetric ansatz with respect to 
the indices a. We think that this should be at least a good approximation, because our 
model is an example of a teacher-student learning scenario, where student and teacher 
match in the sense that the student uses the right statistical model for the data. For the 
Gibbs learning scenario ((3 = 1), where the symmetry of student and teacher becomes 
perfect in the replica calculation (this can be seen by introducing a further average over 
B, using the prior ((D), replica symmetry is usually considered to be exact (however no 
general proof has been given sofar). Hence, assuming that the effects of replica symmetry 
breaking are small, we have refrained from performing a replica stability analysis. 

The treatment of the replica indices b is much simpler, because the order parameters 
(see Appendix A) do not depend on them. Hence, as long as f3 is an integer, no further 
symmetry assumptions are required for the 6's. Although we don't have a proof that 
the continuation to noninteger (3 is unique, we expect that the limit (3 — > 00 exists and 
can be safely calculated using a sequence of integers. 

The hard selection problem of decision variables is treated similarly using the (zero 
temperature) free energy which is defined from the partition function 

Z h = £ e -0w fc ({v%) (19) 

with the energy fllUp. The averages which are necessary for the calculation of error 
measures, e.g. 

* = - arccos ^HUEl (20) 

can be found in a standard way from derivatives of the free energy with respect to 
appropriate external fields, e.g. 

y !j \ B = - lim A lim 4 In T e-WnUVM (21) 
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where 



KhiiVU A) = - In J dJ V (ID, J, exp 



-A J2 J j B j 



Explicit calculations of the free energies and order parameters for both cases are given 
in the appendices. 

4. The EM- Algorithm 

Unfortunately, the maximization of the posterior distributions cannot be carried out in 
closed form and must be done numerically. Usually, nonlinear optimization problems are 
solved by gradient descent algorithms which require a tuning of the step sizes. However, 
for the type of (generalized) maximum likelihood problem for mixture distributions such 
as ([|) and @, there is a simpler and well known algorithm which has been developped by 
Dempster et al @j. This so-called expectation maximization (EM) algorithm guarantees 
that the (generalized) likelihood is nondecreasing for every iteration step and converges 
to a local maximum. To explain the idea for the soft selection problem, let us assume 
for the moment that the hidden variables {V^}^ were actually known. Then the 
corresponding log-likelihood In [V (ID, | J) V (J)] could be maximized in closed 

form. In the EM algorithm, the true values of the hidden variables are replaced 
iteratively by suitable averages. At iteration i, in the expectation step, the function 



A(J, J (l) ) := (In [V (ID, {V»}, | J) V (J)}}^^^ (22) 

is calculated, which is the log likelihood of observed and hidden data averaged over 
the posterior distribution of the hidden data, given the old estimate J®. In the 
maximization step, (p2|) is maximized with respect to J in order to obtain the new 
iteration J (m) . 

We will not give the proof of convergence here, as it is relatively simple and can be 
found in many textbooks (see e.g. ||). However, we can easily see that a fixed point of 
the algorithm is also a local extremum of (7). At the maximum of (p2|), we have 

= ^-A(J, JW) = A(i„ [V (id, { un M I J) V {J)]) v{{VW j*) 
~ P(ID,{^} M ,J)P(ID, J«' 



Hence, at the fixed point, where = J, we also have ^j- InV (ID, J) = 0. For the 
explicit calculation, we need the conditional distribution of the hidden variables, given 
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the data and J 

7>(ID,0"%,J) 



V({V^\JD,J) 



Using the distribution (2), we get 
d 



■PQD, J) 
exp[-^/ M (J)] 
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which gives 



where 



(24) 



(V) = e wv I ^ J(i) 
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(25) 



Hence, the estimate «7 for £? is of the form of a weighted Hebbian sum, where each 
example has a weight which is proportional to the estimated probability (V 1 ), that the 
example is not an outlier. It is interesting to look at the limiting case rj — > — oo, i.e. 
where all examples are from the double cluster and where no outliers are present. In 
this case, the EM iteration stops after one step, and we get 

(V) = 1 for all fi 

i y ^s 71 

y/N a + I/7 



which is the usual Hebbian vector. 

Similarly, to apply the EM algorithm to the hard selection problem with the mixture 
distribution @, we take J as the hidden quantity. In each iteration step, we have to 
maximize 

MiVUiW}®) ■= (lnV(V,{V^,J)) p ,^ 

2 W v N ^ 

-^E^/^E^E^ 2 } (27) 
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with respect to {V^}^. Defining 



a -=lyv^ {i) + 1 
TV ^ 




(28) 



we obtain for the expectations at step % 



a 



(29) 




Finally, after convergence, we use (Jj) as an estimate for B y 

5. Results and Discussion 

5.1. Soft Selection 

Solving for the order parameters and assuming that 7 = 7 we find that for fixed 77, as 
expected, both error measures $ and A decrease towards with an increasing number 
aN of examples, showing that the algorithm is able to find the true structure vector B. 
Since for the EM algorithm both error measures show qualitatively the same behaviour, 
we will concentrate mainly on the angle $. 

Fig. 1 shows A(a) for 77 = 0. The second curve gives the performance of the 
Hebbian rule (26). It demonstrates the importance of selecting informative examples. 
If all examples are weighted equally (and 77 7^ 00), then the true vector B cannot be 
recovered for a — > 00. In Fig. 2, $(a) (EM algorithm) is shown for 77 = and rj = 4. 
Since it was harder to perform simulations for 77 = 4, where only about 1.8% of the 
examples are informative, we have shown simulations only for 77 = 0. Asymptotically 
one finds a decrease of the error like 



where is the asymptotic value of the orderparameter R and both R^ and c depend 
on rj. 

As expected, for fixed a, the error increases with 77, i.e. with a growing number of 
outliers. More interesting is the nonsmooth behaviour of the second curve, which gives 
a sudden drop of the error as 77 is varied. This phase transition can be observed in more 
detail in the relief plot of the order parameters R and Q in Figs. 3a and 3b. In regions of 
large 77 or large a, the saddlepoint equations have three solutions. Taking the solution 




(30) 
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with the smallest free energy leads to a jump of the order parameters. It is easier to 
investigate the transition by simulations as a function of 77, for fixed a. This is shown 
in Fig. 4, together with the predictions of the theory. 

We have simulated the EM-algorithm starting from random initial conditions and 
averaged the order parameters over many samples of random inputs. Fixing a, the 
simulations show a good agreement with the theory for small and large values of rj, 
but discrepancies show up close to the predicted transition. Since the average fraction 
V of informative data points decreases exponentially with rj, finite size effects play a 
crucial role in the simulations. E.g. for rj — 4, less than 2 examples out of N = 100 are 
informative on average whereas the replica theory is based on infinitely many examples 
from the structured clusters. Hence, we have performed a finite size scaling to determine 
the critical value Vq, where the transition sets in. Since for small 77 (large V), the 
simulations show rather small statistical fluctuations around a value of R close to 1, we 
have (for each N) defined Vq as the point, where the distribution of the observed values 
for R significantly broadens, indicating the onset of transitions to different values of R. 
A simple linear extrapolation to iV = oo as shown in the inset of Fig. 4 gives a value for 
Vq which is in good agreement with the predicted value for the phase transition. The 
large error bar at 77 = 6.8 is explained from the fact that the values for $ (eq. (14)) 
have been obtained by using the sample averages of R and Q which (for finite N) show 
a transition at slightly different values of 77. 



5.2. Hard Selection 

Solving the orderparameter equations for the free energy flB3| ) at zero temperature, 
we find similar first order transitions as for the method of soft selection. For 77 small 
enough, there is only one solution which has a nonzero overlap to the teacher vector B. 
Increasing 77 (and thereby the number of outliers) beyond a value 7/0, another solution 
with R = Q = z = (see eq. (B3)) appears, i.e. where all = and all data are 
considered to be outliers. Here 

» =-!+£■ pi) 

Between 770 and a second parameter value r] c , however, this trivial solution has a higher 
free energy fh = than the nontrivial one. Finally, for 77 > rj c , the trivial solution with 
zero order parameters, giving rise to <3> = 1/2, is the one with lowest free energy. Fig. 5 
shows this critical 77 as a function of a. 

So, unlike in the soft selection case, we have, for a large range of 77, two solutions 
of the orderparameter equations. This is reflected in the simulations, the single runs 
clearly tending to either of these two optima. Effects of metastability (which would 
be a sign of a rugged energy landscape and indicate strong effects of replica symmetry 
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breaking) could not be observed. However, a finite size scaling for the transition point 
did not lead to a satisfactory agreement with the theory. We think that the observed 
discrepancy is a dynamical effect, where the EM algorithm, starting from a random 
initial condition, is unable to reach the global minimum and converges only to the local 
one, thus shifting the phase transition to smaller values of rj. We have balanced this 
effect to some extent by keeping only those simulations (as long as they occur) where 
the EM algorithm converges to the solution with nonzero overlap to the vector B. 

Fig. 6 shows the performance of the hard selection for a = 20. Comparision to 
Fig. 4 suggests that the soft selection should be preferred. The difference between the 
performance of the two algorithms becomes more drastic for a — > oo: The soft selection 
algorithm is able to tolerate an arbitrary fraction of outliers as long as enough data 
available. Eventually, it will always find the true teacher vector B. On the other hand, 
for hard selection, the explicit solution of the orderparameter equations for a —>■ oo 
shows that there is always a critical fraction of outliers (corresponding to a parameter 
rj c (B5)), where learning is no longer possible even if inifinitely many examples are 
available. It is also interesting to investigate the influence of the overlap of the two 
gaussian clouds in the structured input distribution on the transition parameter r\ c . 
Fig. 7 shows rj c for a = oo as a function of 7, which gives the inverse squared width of 
each gaussian and measures so the distinguishability of the clouds. If 7 is below 0.278, 
somewhat surprising, the critical r\ jumps discontinuously to zero, i.e. if the overlap of 
the two clouds is above a certain value, only 50% outliers can be tolerated. 

Phase transitions in the performance of learning algorithms have been observed 
frequently in the statistical mechanics of neural networks. Since such effects do not 
occur in asymptotic (in the sense of large a) expansions or in the exact bounds known 
in statistics they seem to be one of the major contributions of statistical mechanics 
to the field of computational learning theory. Phase transitions occur in multilayer 
networks, where they are can be related to the breaking of symmetries which are related 



to the network architecture [12, 11 1. Other examples include models with a so-called 



student teacher mismatch models with discrete adjustable parameters |7|, § and 



models of unsupervised learning || |K|. For the present supervised learning model, 
where the basic adjustable parameters are continuous variables and where the learner 
matches with the distribution of the data, the phase transition was unexpected. It 
will be interesting to apply recently developped combinations of statistical mechanics 
techniques and methods of information theory [14] to establish the existence of phase 
transitions in mixture models in more general circumstances. 
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Appendix A. Free energy and order parameters for soft selection 

Upon averaging, we obtain 
(Z n ) = 
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In this expression (and in the following one) the order parameters have to be taken at 
their saddle point values. After a lengthy calculation, we arrive at an expression for the 
free energy 

1 = \m^5 - h HQ - q)+ \ Q - a t' m q ' Q) + const ' (A1) 
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with 
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For /5 — ► 00 we have to take the limit q — > Q. With the ansatz (Q — =: 2 = (1), 
we get in the limit 
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This yields the saddlepoint equations 

() ± 9/ _ R _ ae"" 
~~ aR ~~ 7 ~ 1 + e^ 



f ft f \ ae"'' 

, 5 2 )~T+1F' 



-/1 j+const.(A2) 



Q ^ df = Q - R 2 



a f 7 2 



ae 



v ^2 



9/ 

dQ 



2z 2 l + e~ r >" ± 2-f l + e~ r >" ± 2-f 



2z 2 1 + e-" \ 2 



ae 



7 



l + e- 1 ? V 2 



^ J 6 + -I 2 + 



7 



2^ 



h+ 2 h , 



h+ 2 h , 



where 



h 
h 
h 

h 

h 
h 
h 



:= / Dx- 



-2a 



Dx- 



+ 1 + (2 - b)e- a 
2e- 2a + (2 - b)e- a 



-2a 



Dx 
Dx 



+ 1 + (2 - b)e- 
2e" 2a + (2 - b)e~' 
( e -2o + 1 + ( 2 - 6) e - a )" 

e -2a , j , 2e -a 



-2a 



1 + (2 - 6)e- a ) z 



y L>xm(l + e a ) 

/ Dx — - 

J e- a + 1 

/ 



'e- a + 1 



15 



For the Ij, a has to be replaced by a, where 
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Appendix B. Free energy and order parameters for hard selection 

The Hamiltonian ( |10D is explicitely given by 

^({^W == - In/ dJV(lD,J,{V»},) 
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Averaging the partition function (|19l) yields 
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The free energy simplifies in the limit (3 — > 00, where the scaling (3{q—Q) =: z = O (1) 
is used. We finally obtain fh as a function of the actual orderparameters at the 
saddlepoint: 

A = ,Q + i ln( ^ + 1) _W±^f)Wi±i)22! ( B3) 

2 4(^77 + ^7 2 + 7) 2 



A similar calculation using (21) yields the averages 

E (JM = nJ^- ( 1 - ' 

j Ql + 1 V Q77 + ^7 2 + 7, 



■> 2(g 7 7 + z7 2 + 7 ) 2 Q1 + 1 

In the limit a — > 00, the resulting order parameter equations can be further simplified 
by making the scaling ansatze R = aRo, Q = aQ , z = —az , where Rq,Q ,z are 
independent of a as a — > 00. For 7 = 7, the equation for the critical ratio of outliers 
r] c , where the trivial solution with zero orderparameters has the global minimum of the 
free energy, is determined from 



= 77 - 277rr?exp[7 + 2rf\<$> 2 [^- ^2rj]/ ^exp[^2^r]} + exp[7/2 + rj] (B5) 
+0^exp[ 7 /2 + V ] (-2$^ 2e^ + 2e^[^]\ }' . 
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Figure 1. Comparison between EM-algorithm and naive Hebb rule. Parameters are 
j] = 0,7 = 7 = 10. The solid lines show the theoretical results. Simulations are done 
with N — 500; here as in subsequent plots, bars mark standard deviations over 100 




Figure 2. <j>(a) for rj = and r} = 4, respectively (MAP estimate). The simulations 
at ?7 = are performed with N = 500; results are averaged over 100 runs. 




Figure 3. Order parameters R(a, rj) (top) and Q(a, rf) (bottom) for MAP. As in fig. 2 
and subsequent plots, we set 7 = 7 = 10. 
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Figure 4. Error $ for soft selection versus amount of outliers, represented by r\. 
The relative number of data is fixed at a = 20. The dashed part of the theoretical 
curve denotes the region where three solutions of the saddlepoint equations exist. 
The solid line follows the solution with minimal free energy. Simulations arc results 
from 100 runs with N = 100. Note that, for finite N, the transitions of the two 
orderparameters do not coincide. The error measure $ follows roughly the overlap R 
between solution vector and structure axis, whereas the drop in Q gives rise to the 
increased standard deviation at r) = 6.8. The inset shows a finite size scaling of the 
phase transition as described in the text. The corresponding dimensions of the data 
are N = 10, 25, 50, 100, 1000 respectively. 




Figure 5. Critical fraction of outliers for hard selection as a function of a (solid line). 
The dashed line represents the phase transition for soft selection. 
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Figure 6. Error <I> for hard selection versus amount of outliers, represented by i]. As 
in Fig. 4, a = 20 and simulations are performed with TV = 100 and 100 runs. The 
solid line indicates the theoretical result for the global optimum, the dashed line for 
the local one. 
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Figure 7. Asymptotic critical fraction of outliers for hard selection, plotted against 
inverse squared width of the gaussian clusters. 



